Objectives

  • Demonstrate lubridate for working with dates in R and ggplot2
  • Define correlation
  • Review how to implement smoothing lines
  • Generate and interpret scatterplot matricies and correlation heatmaps
  • Introduce and generate parallel coordinate plots
  • Identify methods for implementing three-dimensional graphs in R
library(tidyverse)
library(ggthemes)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(plotly)

options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())

Working with dates in R

For more details on the lubridate package, check out R for Data Science.

When importing data, date variables can be somewhat tricky to correctly store and utilize. In a spreadsheet, tabular format, dates by default will appear either as numeric (20174018) or string (2016-04-18, April 18th, 2017, etc.) columns. If you want to perform tasks such as: extracting and summarizing over individual components (year, month, day, etc.), we need to represent dates in a different, yet standardized, format.

lubridate is a tidyverse package that facilitates working with dates (and date-times) in R.

library(lubridate)

Formatting dates

When using readr to import data files, R will use parse_date() or parse_datetime() to try and format any columns it thinks contain dates or date-times. To manually format dates from strings, use the appropriate function combining y, m, and d in the proper order depending on the original format of the date:

ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"

Extracting date components

Let’s practice extracting components of dates using an example dataset. flights-departed.csv is a time series data file containing the daily number of departing commercial flights in the United States from 1988-2008.

(flights <- read_csv("data/flights-departed.csv"))
## # A tibble: 7,671 × 2
##          date value
##        <date> <int>
## 1  1988-01-01 12681
## 2  1988-01-02 13264
## 3  1988-01-03 13953
## 4  1988-01-04 13921
## 5  1988-01-05 13932
## 6  1988-01-06 13157
## 7  1988-01-07 11159
## 8  1988-01-08 11631
## 9  1988-01-09 12045
## 10 1988-01-10 13160
## # ... with 7,661 more rows

We will use ggplot2 to generate several graphs based on the data. The first is a simple line plot over time of the daily commercial flights. To build this, we don’t need to modify flights:

ggplot(flights, aes(date, value)) +
  geom_line() +
  labs(x = NULL,
       y = "Number of departing commercial flights")

But this is quite noisy. Instead, let’s draw a line plot depicting commercial flights over a one-year period, with separate lines for each year in the data (1988, 1989, 1990, etc.). To do that, we need to create a new variable year which will serve as our grouping variable in ggplot():

(flights <- flights %>%
  mutate(year = year(date),
         yday = yday(date),
         # hack to label the x-axis with months
         days = dmy(format(date,"%d-%m-2016"))))
## # A tibble: 7,671 × 5
##          date value  year  yday       days
##        <date> <int> <dbl> <dbl>     <date>
## 1  1988-01-01 12681  1988     1 2016-01-01
## 2  1988-01-02 13264  1988     2 2016-01-02
## 3  1988-01-03 13953  1988     3 2016-01-03
## 4  1988-01-04 13921  1988     4 2016-01-04
## 5  1988-01-05 13932  1988     5 2016-01-05
## 6  1988-01-06 13157  1988     6 2016-01-06
## 7  1988-01-07 11159  1988     7 2016-01-07
## 8  1988-01-08 11631  1988     8 2016-01-08
## 9  1988-01-09 12045  1988     9 2016-01-09
## 10 1988-01-10 13160  1988    10 2016-01-10
## # ... with 7,661 more rows
ggplot(flights, aes(days, value)) +
  geom_line(aes(group = year), alpha = .2) +
  geom_smooth(se = FALSE) +
  scale_x_date(labels = scales::date_format("%b")) +
  labs(x = NULL,
       y = "Number of departing commercial flights")

Or we could summarize the distribution of departing commercial flights by days in each month over the 20 year time period:

(flights <- flights %>%
  mutate(month = month(date, label = TRUE)))
## # A tibble: 7,671 × 6
##          date value  year  yday       days month
##        <date> <int> <dbl> <dbl>     <date> <ord>
## 1  1988-01-01 12681  1988     1 2016-01-01   Jan
## 2  1988-01-02 13264  1988     2 2016-01-02   Jan
## 3  1988-01-03 13953  1988     3 2016-01-03   Jan
## 4  1988-01-04 13921  1988     4 2016-01-04   Jan
## 5  1988-01-05 13932  1988     5 2016-01-05   Jan
## 6  1988-01-06 13157  1988     6 2016-01-06   Jan
## 7  1988-01-07 11159  1988     7 2016-01-07   Jan
## 8  1988-01-08 11631  1988     8 2016-01-08   Jan
## 9  1988-01-09 12045  1988     9 2016-01-09   Jan
## 10 1988-01-10 13160  1988    10 2016-01-10   Jan
## # ... with 7,661 more rows
ggplot(flights, aes(month, value)) +
  geom_violin() +
  geom_boxplot(width = .1, outlier.shape = NA) +
  labs(x = NULL,
       y = "Number of departing commercial flights")

Hmmm, there seems to be an outlier in September. What’s up with that?

Finally, we can generate a heatmap depicting the change over time of this data by creating a calendar-like visualization.1 In order do this, we need the following grammar for the graph:

  • Layer
    • Data - flights
    • Mapping
      • \(x\) - weekday (e.g. Sunday, Monday, Tuesday)
      • \(y\) - week in month (e.g. first week, second week, third week)
      • Fill - value (number of departing flights)
    • Statistical transformation (stat) - identity
    • Geometric object (geom) - geom_tile()
    • Position adjustment (position) - none
  • Scale
    • Fill - low and high-end colors (use shading to identify in-between values)
  • Coordinate system - Cartesian coordinate plane
  • Faceting - facet_grid() (year X month)

In order to generate this graph then, we need to create several new variables for flights:

  • Year
  • Month
  • Weekday
  • Week-in-month

We can use lubridate to directly generate three of those variables (we’ve already generated year and month):

(flights <- flights %>%
  mutate(weekday = wday(date, label = TRUE)))
## # A tibble: 7,671 × 7
##          date value  year  yday       days month weekday
##        <date> <int> <dbl> <dbl>     <date> <ord>   <ord>
## 1  1988-01-01 12681  1988     1 2016-01-01   Jan     Fri
## 2  1988-01-02 13264  1988     2 2016-01-02   Jan     Sat
## 3  1988-01-03 13953  1988     3 2016-01-03   Jan     Sun
## 4  1988-01-04 13921  1988     4 2016-01-04   Jan     Mon
## 5  1988-01-05 13932  1988     5 2016-01-05   Jan    Tues
## 6  1988-01-06 13157  1988     6 2016-01-06   Jan     Wed
## 7  1988-01-07 11159  1988     7 2016-01-07   Jan   Thurs
## 8  1988-01-08 11631  1988     8 2016-01-08   Jan     Fri
## 9  1988-01-09 12045  1988     9 2016-01-09   Jan     Sat
## 10 1988-01-10 13160  1988    10 2016-01-10   Jan     Sun
## # ... with 7,661 more rows

We use label = TRUE to generate factor labels for these values (January, February, March) instead of the numeric equivalent (1, 2, 3).

To generate the final week-in-month variable, we need to combine a few lubridate functions to get exactly what we want:

(flights <- flights %>%
  # generate variables for week in the year (1-54) and the day in the year (1-366)
  mutate(week = week(date),
         yday = yday(date)) %>%
  # normalize to draw calendar correctly - wday should represent the number of days from the Sunday of the week containing January 1st, then adjust based on that
  group_by(year) %>%
  mutate(yday = yday + wday(date)[1] - 2,
         week = floor(yday / 7)) %>%
  group_by(year, month) %>%
  mutate(week_month = week - min(week) + 1))
## Source: local data frame [7,671 x 9]
## Groups: year, month [252]
## 
##          date value  year  yday       days month weekday  week week_month
##        <date> <int> <dbl> <dbl>     <date> <ord>   <ord> <dbl>      <dbl>
## 1  1988-01-01 12681  1988     5 2016-01-01   Jan     Fri     0          1
## 2  1988-01-02 13264  1988     6 2016-01-02   Jan     Sat     0          1
## 3  1988-01-03 13953  1988     7 2016-01-03   Jan     Sun     1          2
## 4  1988-01-04 13921  1988     8 2016-01-04   Jan     Mon     1          2
## 5  1988-01-05 13932  1988     9 2016-01-05   Jan    Tues     1          2
## 6  1988-01-06 13157  1988    10 2016-01-06   Jan     Wed     1          2
## 7  1988-01-07 11159  1988    11 2016-01-07   Jan   Thurs     1          2
## 8  1988-01-08 11631  1988    12 2016-01-08   Jan     Fri     1          2
## 9  1988-01-09 12045  1988    13 2016-01-09   Jan     Sat     1          2
## 10 1988-01-10 13160  1988    14 2016-01-10   Jan     Sun     2          3
## # ... with 7,661 more rows

Now that we have the data correctly formatted and all the components are extracted, we can draw the graph:

ggplot(flights, aes(weekday, week_month, fill = value)) +
  facet_grid(year ~ month) +
  geom_tile(color = "black") +
  scale_fill_continuous(low = "green", high = "red") +
  scale_x_discrete(labels = NULL) +
  scale_y_reverse(labels = NULL) +
  labs(title = "Domestic commercial flight activity",
       x = NULL,
       y = NULL,
       fill = "Number of departing flights") +
  theme_void() +
  theme(legend.position = "bottom",
        legend.text = element_text(angle = 45))

Aha, now the outlier makes sense. In the days following the September 11th attacks, the United States grounded virtually all commercial air traffic.

Smoothing lines

When examining multivariate continuous data, scatterplots are a quick and easy visualization to assess relationships. However if the data points become too densely clustered, interpreting the graph becomes difficult. Consider the diamonds dataset:

p <- ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Carat size",
       y = "Price")
p

What is the relationship between carat size and price? It appears positive, but there are also a lot of densely packed data points in the middle of the graph. Smoothing lines are a method for summarizing the relationship between variables to capture important patterns by approximating the functional form of the relationship. The functional form can take on many shapes. For instance, a very common functional form is a best-fit line, also known as ordinary least squares (OLS) or simple linear regression. We can estimate the model directly using lm(), or we can directly plot the line by using geom_smooth(method = "lm"):

p +
  geom_smooth(method = "lm", se = FALSE)

The downside to a linear best-fit line is that it assumes the relationship between the variables is additive and monotonic. Therefore the summarized relationship between carat size and price seems wildly incorrect for diamonds with a carat size larger than 3. Instead we could use a generalized additive model which allow for flexible, non-linear relationships between the variables while still implementing a basic regression approach:2

p +
  geom_smooth(se = FALSE)

Locally weighted scatterplot smoothing (local regression, LOWESS, or LOESS) fits a separate non-linear function at each target point \(x_0\) using only the nearby training observations. This method estimates a regression line based on localized subsets of the data, building up the global function \(f\) point-by-point.

Here is an example of a local linear regression on the ethanol dataset in the lattice package:

The LOESS is built up point-by-point:

One important argument you can control with LOESS is the span, or how smooth the LOESS function will become. A larger span will result in a smoother curve, but may not be as accurate. A smaller span will be more local and wiggly, but improve our fit to the training data.

LOESS lines are best used for datasets with fewer than 1000 observations, otherwise the time and memory usage required to compute the line increases exponentially.

Coefficient of correlation (\(r\))

  • Produces a measure of association, known as Pearson’s \(r\), that gauges the direction and strength of a relationship between two continuous variables
  • Scales between \(-1\) and \(+1\)
  • \(-1\) – perfect negative association between the variables
  • \(+1\) – perfect positive association between the variables
  • \(0\) – no relationship between the variables
  • Unit-less measure - no matter what scale the variables fall on (e.g. turnout, education, income), the number will always fall between \(-1\) and \(+1\)
r_plot <- function(r, n = 100){
  xy <- ecodist::corgen(len = n, r = r) %>%
    bind_cols
  
  ggplot(xy, aes(x, y)) +
    geom_point() +
    ggtitle(str_c("Pearson's r = ", r))
}

r <- c(.8, 0, -.8)

for(r in r){
  print(r_plot(r))
}

Scatterplot matricies

To quickly visualize several variables in a dataset and their relation to one another, a scatterplot matrix is a quick and detailed tool for generating a series of scatterplots for each combination of variables. Consider credit.csv which contains a sample of individuals from a credit card company, identifying their total amount of credit card debt and other financial/demographic variables:

credit <- read_csv("data/Credit.csv") %>%
  # remove first ID column
  select(-X1)
names(credit) <- stringr::str_to_lower(names(credit))   # convert column names to lowercase
str(credit)
## Classes 'tbl_df', 'tbl' and 'data.frame':    400 obs. of  11 variables:
##  $ income   : num  14.9 106 104.6 148.9 55.9 ...
##  $ limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
##  $ rating   : int  283 483 514 681 357 569 259 512 266 491 ...
##  $ cards    : int  2 3 4 3 2 4 2 2 5 3 ...
##  $ age      : int  34 82 71 36 68 77 37 87 66 41 ...
##  $ education: int  11 15 11 11 16 10 12 9 13 19 ...
##  $ gender   : chr  "Male" "Female" "Male" "Female" ...
##  $ student  : chr  "No" "Yes" "No" "No" ...
##  $ married  : chr  "Yes" "Yes" "No" "No" ...
##  $ ethnicity: chr  "Caucasian" "Asian" "Asian" "Asian" ...
##  $ balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 12
##   .. ..$ X1       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Income   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Limit    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Rating   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Cards    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Age      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Education: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Gender   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Student  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Married  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Ethnicity: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Balance  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

If we want to quickly assess the relationship between all of the variables (in preparation for more advanced statistical learning techniques), we could generate a matrix of scatterplots using the base graphics package:

pairs(select_if(credit, is.numeric))

  • This only works well if we use strictly quantitative variables (hence the use of select_if())
  • We don’t automatically see correlation information
  • It’s not built using ggplot2 so it’s hard to modify using techniques with which we are already familiar

Instead, we can use GGally::ggpairs() to generate a scatterplot matrix. GGally is a package for R that extends ggplot2 by adding helper functions for common multivariate data structures. ggpairs() is a function that allows us to quickly generate a scatterplot matrix.

library(GGally)

ggpairs(select_if(credit, is.numeric))

When applied to strictly numeric variables, the lower triangle generates scatterplots, the upper triangle prints the correlation coefficient, and the diagonal panels are density plots of the variable.

Because ggpairs() is ultimately based on ggplot(), we can use the same types of commands to modify the graph. For instance, if we want to use the color aesthetic to distinguish between men and women in the dataset:

ggpairs(credit, mapping = aes(color = gender),
        columns = c("income", "limit", "rating", "cards", "age", "education", "balance"))

Or if we wanted to draw a smoothing line instead of scatterplots, we can modify the graph’s matrix sections:

ggpairs(select_if(credit, is.numeric),
        lower = list(
          continuous = "smooth"
        )
)

Hmm, too difficult to see the smoothers because the points are so dense. We can use wrap() to pass through individual parameters to the underlying geom_():

ggpairs(select_if(credit, is.numeric),
        lower = list(
          continuous = wrap("smooth", alpha = .1, color = "blue")
        )
)

Or we can write a custom function and apply it to the lower triangle panels:

scatter_smooth <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    # make data points transparent
    geom_point(alpha = .2) +
    # add default smoother
    geom_smooth(se = FALSE)
}

ggpairs(select_if(credit, is.numeric),
        lower = list(
          continuous = scatter_smooth
        )
)

ggpairs(credit, mapping = aes(color = gender),
        columns = c("income", "limit", "rating", "cards", "age", "education", "balance"),
        lower = list(
          continuous = scatter_smooth
        )
)

ggpairs() also works on datasets with a mix of qualitative and quantitative variables, drawing appropriate graphs based on whether the variables are continuous or discrete:

ggpairs(select(rcfss::scorecard, type:debt))

Heatmap of correlation coefficients

Scatterplot matricies can provide lots of information, but can also be very densely packed. Perhaps instead we want to quickly visualize the correlation between each of the variables.3 We can easily calculate the correlation coefficients using cor():

(mpg_lite <- select_if(mpg, is.numeric))
## # A tibble: 234 × 5
##    displ  year   cyl   cty   hwy
##    <dbl> <int> <int> <int> <int>
## 1    1.8  1999     4    18    29
## 2    1.8  1999     4    21    29
## 3    2.0  2008     4    20    31
## 4    2.0  2008     4    21    30
## 5    2.8  1999     6    16    26
## 6    2.8  1999     6    18    26
## 7    3.1  2008     6    18    27
## 8    1.8  1999     4    18    26
## 9    1.8  1999     4    16    25
## 10   2.0  2008     4    20    28
## # ... with 224 more rows
(cormat <- mpg_lite %>%
  cor %>%
  round(2))
##       displ  year   cyl   cty   hwy
## displ  1.00  0.15  0.93 -0.80 -0.77
## year   0.15  1.00  0.12 -0.04  0.00
## cyl    0.93  0.12  1.00 -0.81 -0.76
## cty   -0.80 -0.04 -0.81  1.00  0.96
## hwy   -0.77  0.00 -0.76  0.96  1.00

But who likes yucky tables. Instead let’s turn this into a heatmap. First we need to reshape the data into a tidy structure:

  1. Each row contains a single observation
  2. Each column contains a single variable
  3. Each cell contains a single value

What we need is a data frame with three columns:

  1. First variable name
  2. Second variable name
  3. Correlation coefficient

We can use reshape2::melt() to quickly accomplish this:

library(reshape2)
(melted_cormat <- melt(cormat))
##     Var1  Var2 value
## 1  displ displ  1.00
## 2   year displ  0.15
## 3    cyl displ  0.93
## 4    cty displ -0.80
## 5    hwy displ -0.77
## 6  displ  year  0.15
## 7   year  year  1.00
## 8    cyl  year  0.12
## 9    cty  year -0.04
## 10   hwy  year  0.00
## 11 displ   cyl  0.93
## 12  year   cyl  0.12
## 13   cyl   cyl  1.00
## 14   cty   cyl -0.81
## 15   hwy   cyl -0.76
## 16 displ   cty -0.80
## 17  year   cty -0.04
## 18   cyl   cty -0.81
## 19   cty   cty  1.00
## 20   hwy   cty  0.96
## 21 displ   hwy -0.77
## 22  year   hwy  0.00
## 23   cyl   hwy -0.76
## 24   cty   hwy  0.96
## 25   hwy   hwy  1.00

We can then use geom_tile() to visualize the correlation matrix:

ggplot(melted_cormat, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile()

Not exactly pretty. We can clean it up first by reducing redundancy (remember the upper and lower triangles provide duplicate information):

# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}

# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

upper_tri <- get_upper_tri(cormat)
upper_tri
##       displ year  cyl   cty   hwy
## displ     1 0.15 0.93 -0.80 -0.77
## year     NA 1.00 0.12 -0.04  0.00
## cyl      NA   NA 1.00 -0.81 -0.76
## cty      NA   NA   NA  1.00  0.96
## hwy      NA   NA   NA    NA  1.00

Now melt upper_tri and repeat the same process, cleaning up the colors for the heatmap as well to distinguish between positive and negative coefficients:

melted_cormat <- melt(upper_tri, na.rm = TRUE)

ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white") +
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1)) +
 coord_fixed()

We can also reorder the correlation matrix according to correlation coefficient to help reveal additional trends:

reorder_cormat <- function(cormat){
  # Use correlation between variables as distance
  dd <- as.dist((1-cormat)/2)
  hc <- hclust(dd)
  cormat <-cormat[hc$order, hc$order]
}

# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)

# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)

# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()+ # minimal theme
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1))+
  coord_fixed()

# Print the heatmap
print(ggheatmap)

Finally we can directly label the correlation coefficient values on the graph, so we have both the color channel and exact values:

ggheatmap + 
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank(),
    legend.position = "bottom")

To make it more flexible, we can also turn all of this into a function that works for any dataset:

cormat_heatmap <- function(data){
  # generate correlation matrix
  cormat <- round(cor(data), 2)
  
  # melt into a tidy table
  get_upper_tri <- function(cormat){
    cormat[lower.tri(cormat)]<- NA
    return(cormat)
  }
  
  upper_tri <- get_upper_tri(cormat)
  
  # reorder matrix based on coefficient value
  reorder_cormat <- function(cormat){
    # Use correlation between variables as distance
    dd <- as.dist((1-cormat)/2)
    hc <- hclust(dd)
    cormat <-cormat[hc$order, hc$order]
  }
  
  cormat <- reorder_cormat(cormat)
  upper_tri <- get_upper_tri(cormat)
  
  # Melt the correlation matrix
  melted_cormat <- melt(upper_tri, na.rm = TRUE)
  
  # Create a ggheatmap
  ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
    geom_tile(color = "white")+
    scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                         midpoint = 0, limit = c(-1,1), space = "Lab", 
                         name="Pearson\nCorrelation") +
    theme_minimal()+ # minimal theme
    theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                     size = 12, hjust = 1))+
    coord_fixed()
  
  # add correlation values to graph
  ggheatmap + 
    geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
    theme(
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      panel.grid.major = element_blank(),
      panel.border = element_blank(),
      panel.background = element_blank(),
      axis.ticks = element_blank(),
      legend.position = "bottom")
}

cormat_heatmap(select_if(mpg, is.numeric))

cormat_heatmap(select_if(credit, is.numeric))

cormat_heatmap(select_if(diamonds, is.numeric))

Parallel coordinate plots

Parallel coordinate plots are an alternative graphical format for multivariate data analysis (continuous or discrete). They can be quite busy and messy. Key things for parallel coordinate plots:

  • Ordering variables in different ways helps to identify relevant patterns. Therefore a lot of this is trial and error
  • Adding interactivity (as we will see in later weeks) helps
ggparcoord(data = iris, columns = 1:4, groupColumn = 5)

# with the iris data, order the axes by overall class (Species) separation
# using the anyClass option
ggparcoord(data = iris, columns = 1:4, groupColumn = 5, order = "anyClass")

# add points to the plot, add a title, and use an alpha scalar to make the
# lines transparent
p <- ggparcoord(data = iris, columns = 1:4, groupColumn = 5, order = "anyClass", 
    showPoints = TRUE, title = "Parallel Coordinate Plot for the Iris Data", 
    alphaLines = 0.3)
p

# add some basic interactivity
ggplotly(p)

Representing three-dimensional data in three physical dimensions

Adding a third (or fourth) dimension to a two-dimensional plot is relatively trivial when at least one of the variables is discrete:

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point()

However what happens when you have three continuous dimensions to represent? Can we draw 3D graphs in R? Not easily, and interpreting 3D graphs can also be challenging mentally. ggplot2 cannot draw graphs in three dimensions. One possibility is to keep the data in two physical dimensions by using geom_tile() and adding the third dimension using the fill aesthetic (color channel). For example, say we estimate a logistic regression model of the probability of voting in the 1996 US presidential election and we want to visualize the predicted probability of survival for each combination of these variable:

# import data
(vote <- rcfss::mental_health)
## # A tibble: 1,317 × 5
##    vote96   age  educ female mhealth
##     <dbl> <dbl> <dbl>  <dbl>   <dbl>
## 1       1    60    12      0       0
## 2       1    36    12      0       1
## 3       0    21    13      0       7
## 4       0    29    13      0       6
## 5       1    39    18      1       2
## 6       1    41    15      1       1
## 7       1    48    20      0       2
## 8       0    20    12      1       9
## 9       0    27    11      1       9
## 10      0    34     7      1       2
## # ... with 1,307 more rows
# estimate model
vote_glm <- glm(vote96 ~ age + educ, data = vote, family = "binomial")
tidy(vote_glm)
##          term estimate std.error statistic  p.value
## 1 (Intercept)  -5.0244   0.44482     -11.3 1.38e-29
## 2         age   0.0469   0.00441      10.6 1.94e-26
## 3        educ   0.2816   0.02629      10.7 9.32e-27
# extract predicted probabilities
vote_prob <- vote %>%
  data_grid(age = 18:89, educ = 0:20) %>%
  modelr::add_predictions(vote_glm) %>%
  # convert predicted values to probabilities
  mutate(prob = rcfss::logit2prob(pred))

ggplot(vote_prob, aes(age, educ, fill = prob)) +
  geom_tile() +
  scale_fill_gradient2(midpoint = .5, label = scales::percent) +
  labs(title = "Probability of voter turnout in 1996",
       x = "Age",
       y = "Education (in years)",
       fill = "Probability\nof voting")

# cleaner image using geom_raster and interpolate = TRUE
ggplot(vote_prob, aes(age, educ, fill = prob)) +
  geom_raster(interpolate = TRUE) +
  scale_fill_gradient2(midpoint = .5, label = scales::percent) +
  labs(title = "Probability of voter turnout in 1996",
       x = "Age",
       y = "Education (in years)",
       fill = "Probability\nof voting")

3D scatter plot with mesh

If we wanted to represent this in true three-dimensional fashion, we could use plotly:

plot_ly(vote_prob, x = ~age, y = ~educ, z = ~prob) %>%
  add_mesh()
plot_ly(credit, x = ~limit, y = ~balance, z = ~income) %>%
  add_mesh()

3D surface plot

plot_ly(z = ~volcano) %>% add_surface()
volcano %>%
  melt %>%
  ggplot(aes(Var1, Var2, z = value)) +
  geom_contour(aes(color = ..level..))

Session Info

devtools::session_info()
##  setting  value                       
##  version  R version 3.3.3 (2017-03-06)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2017-04-19                  
## 
##  package     * version  date       source         
##  assertthat    0.1      2013-12-06 CRAN (R 3.3.0) 
##  backports     1.0.5    2017-01-18 CRAN (R 3.3.2) 
##  base64enc     0.1-3    2015-07-28 CRAN (R 3.3.0) 
##  broom       * 0.4.2    2017-02-13 CRAN (R 3.3.2) 
##  colorspace    1.3-2    2016-12-14 CRAN (R 3.3.2) 
##  DBI           0.6      2017-03-09 CRAN (R 3.3.3) 
##  devtools      1.12.0   2016-06-24 CRAN (R 3.3.0) 
##  digest        0.6.12   2017-01-27 CRAN (R 3.3.2) 
##  dplyr       * 0.5.0    2016-06-24 CRAN (R 3.3.0) 
##  evaluate      0.10     2016-10-11 CRAN (R 3.3.0) 
##  forcats       0.2.0    2017-01-23 CRAN (R 3.3.2) 
##  foreign       0.8-67   2016-09-13 CRAN (R 3.3.3) 
##  ggplot2     * 2.2.1    2016-12-30 CRAN (R 3.3.2) 
##  ggthemes    * 3.4.0    2017-02-19 CRAN (R 3.3.2) 
##  gtable        0.2.0    2016-02-26 CRAN (R 3.3.0) 
##  haven         1.0.0    2016-09-23 cran (@1.0.0)  
##  hms           0.3      2016-11-22 CRAN (R 3.3.2) 
##  htmltools     0.3.5    2016-03-21 CRAN (R 3.3.0) 
##  htmlwidgets   0.8      2016-11-09 CRAN (R 3.3.1) 
##  httr          1.2.1    2016-07-03 CRAN (R 3.3.0) 
##  jsonlite      1.3      2017-02-28 CRAN (R 3.3.2) 
##  knitr       * 1.15.1   2016-11-22 cran (@1.15.1) 
##  lattice       0.20-34  2016-09-06 CRAN (R 3.3.3) 
##  lazyeval      0.2.0    2016-06-12 CRAN (R 3.3.0) 
##  lubridate     1.6.0    2016-09-13 CRAN (R 3.3.0) 
##  magrittr      1.5      2014-11-22 CRAN (R 3.3.0) 
##  memoise       1.0.0    2016-01-29 CRAN (R 3.3.0) 
##  mnormt        1.5-5    2016-10-15 CRAN (R 3.3.0) 
##  modelr      * 0.1.0    2016-08-31 CRAN (R 3.3.0) 
##  munsell       0.4.3    2016-02-13 CRAN (R 3.3.0) 
##  nlme          3.1-131  2017-02-06 CRAN (R 3.3.3) 
##  plotly      * 4.5.6    2016-11-12 CRAN (R 3.3.2) 
##  plyr          1.8.4    2016-06-08 CRAN (R 3.3.0) 
##  psych         1.7.3.21 2017-03-22 CRAN (R 3.3.2) 
##  purrr       * 0.2.2    2016-06-18 CRAN (R 3.3.0) 
##  R6            2.2.0    2016-10-05 CRAN (R 3.3.0) 
##  Rcpp          0.12.10  2017-03-19 cran (@0.12.10)
##  readr       * 1.1.0    2017-03-22 cran (@1.1.0)  
##  readxl        0.1.1    2016-03-28 CRAN (R 3.3.0) 
##  reshape2      1.4.2    2016-10-22 CRAN (R 3.3.0) 
##  rmarkdown     1.3      2016-12-21 CRAN (R 3.3.2) 
##  rprojroot     1.2      2017-01-16 CRAN (R 3.3.2) 
##  rvest         0.3.2    2016-06-17 CRAN (R 3.3.0) 
##  scales        0.4.1    2016-11-09 CRAN (R 3.3.1) 
##  stringi       1.1.2    2016-10-01 CRAN (R 3.3.0) 
##  stringr     * 1.2.0    2017-02-18 CRAN (R 3.3.2) 
##  tibble      * 1.2      2016-08-26 cran (@1.2)    
##  tidyr       * 0.6.1    2017-01-10 CRAN (R 3.3.2) 
##  tidyverse   * 1.1.1    2017-01-27 CRAN (R 3.3.2) 
##  viridisLite   0.1.3    2016-03-12 cran (@0.1.3)  
##  withr         1.0.2    2016-06-20 CRAN (R 3.3.0) 
##  xml2          1.1.1    2017-01-24 CRAN (R 3.3.2) 
##  yaml          2.1.14   2016-11-12 cran (@2.1.14)

  1. A la U.S. Commercial Flights, 1995-2008.

  2. geom_smooth() automatically implements the gam method for datasets with greater than 1000 observations.

  3. Example drawn from ggplot2 : Quick correlation matrix heatmap - R software and data visualization.